Introduction

For this particular project, we are looking at the stock prices of three major tech companies: Apple, Facebook and Google, from the years of 2012 to 2018. We got access to the dates, opening S&P 500 indices and closing prices of these companies from Yahoo Finance.

Based on the nature of stock market, there can potentially be temporal structures when analyzing and predicting stock prices. We took the closing price as the response variable and tried to fit appropriate models with the S&P 500 indices as the mean structure to help predict the closing prices each day for each company, as well as understanding the temporal structures within.

Due to the fact that all of the companies share the same timeline and have data of their own at the same time points, we would have to fit three individual models for each of them. We will first look at some Explanatory Data Analyses for all three companies, then try to fit simpler models with no temporal structures, and finally fit and evaluate temporal models for each company. For the temporal model fitting part specifically, we will try two different methods: both auto-fitting ARIMA models, as well as models with Gaussian Process. We then compared the performances of all three types models fit by both methods.

Lastly, we wish to come to a conclusion for our questions of interest: can we predict the closing prices of stocks with the S&P 500 index solely? Is there any temporal dependency in the closing prices? Are there differences among different companies or do they share similar trends and structures in their stocks? Can this trend or model, potentially with the temporal structure, potentially be generalized to other companies? We would also have a discussion on the adequacy, potential problems with the models and provide suggestions for developing this project in the future.

Exploratory Data Analysis

summary(df)
##       Date                 Open         fb_close       google_close   
##  Min.   :2012-05-18   Min.   :1278   Min.   : 17.73   Min.   : 277.7  
##  1st Qu.:2013-11-11   1st Qu.:1765   1st Qu.: 49.50   1st Qu.: 504.1  
##  Median :2015-05-06   Median :2033   Median : 81.73   Median : 581.6  
##  Mean   :2015-05-06   Mean   :2003   Mean   : 90.37   Mean   : 641.2  
##  3rd Qu.:2016-10-25   3rd Qu.:2178   3rd Qu.:124.94   3rd Qu.: 779.6  
##  Max.   :2018-04-20   Max.   :2867   Max.   :193.09   Max.   :1175.8  
##   apple_close    
##  Min.   : 55.79  
##  1st Qu.: 81.64  
##  Median :105.80  
##  Mean   :108.11  
##  3rd Qu.:126.81  
##  Max.   :181.72
#time trend
p1 = ggplot(data = df %>% arrange(Date), aes(x = Date, y = fb_close)) +
  geom_line() +
  ggtitle("Stock Close Price of Facebook")
p2 = ggplot(data = df %>% arrange(Date), aes(x = Date, y = apple_close)) +
  geom_line() +
  ggtitle("Stock Close Price of Apple")
p3 = ggplot(data = df %>% arrange(Date), aes(x = Date, y = google_close)) +
  geom_line() +
  ggtitle("Stock Close Price of Google")
p4 = ggplot(data = df %>% arrange(Date), aes(x = Date, y = Open)) +
  geom_line() +
  ggtitle("S&P 500 Index Open")
grid.arrange(p1, p2, p3, p4, nrow = 2, ncol = 2)

#scatter plot
p1 = ggplot(data = df, aes(x = Open, y = fb_close)) +
  geom_point(size = 0.1) +
  ggtitle("Close~Open for Facebook")
p2 = ggplot(data = df, aes(x = Open, y = apple_close)) +
  geom_point(size = 0.1) +
  ggtitle("Close~Open for Apple")
p3 = ggplot(data = df, aes(x = Open, y = google_close)) +
  geom_point(size = 0.1) +
  ggtitle("Close~Open for Google")
grid.arrange(p1, p2, p3, nrow = 2, ncol = 2)

We mainly focused on three variables Date, Open (Opening S&P indices) and Close (Closing stock price) for all three companies.

The line plots of all of the closing prices of the three companies against date, as well as the S&P 500 index against date, show linear trend as time goes by. This is an expected trend as the stock market is continuously inflating, especially since the tech companies have been rapidly developing during the time period that we’re looking at. Similarly, the S&P 500 index has an increasing the linear trend. Thus, by using the indices as a predictor/mean structure, we’re able to de-trend the linear trend in the closing prices and then move on to looking at any remaining temporal structure.

Then we look at scatter plots of all three companies with the closing price against the open S&P 500 indices. Even though there is a bit of discrepancy in the trend shown by Apple, it is more or less closer to a linear trend. Thus, the scatter plots further confirmed using the S&P 500 index as the mean structure.

Method 1: Simple Linear Models

The first method is fitting a simple linear model for each individual company. We used the S&P 500 index to predict daily closing price for all three companies separately.

#Apple
apple_lm = lm(apple_close ~ Open + Date, data = df)
summary(apple_lm)
## 
## Call:
## lm(formula = apple_close ~ Open + Date, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -24.735 -12.223  -0.845  10.784  35.642 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -67.379897  27.767662  -2.427   0.0154 *  
## Open          0.076482   0.003608  21.197   <2e-16 ***
## Date          0.001346   0.002091   0.644   0.5197    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.17 on 1487 degrees of freedom
## Multiple R-squared:  0.8017, Adjusted R-squared:  0.8014 
## F-statistic:  3005 on 2 and 1487 DF,  p-value: < 2.2e-16
df$apple_naive_pred = predict(apple_lm, data=df$Open)
df$apple_naive_residual = df$apple_close - df$apple_naive_pred
ggplot(data = df, aes(x = Date)) +
  geom_line(aes(y = apple_close, color = "red")) +
  geom_line(aes(y = apple_naive_pred, color = "blue")) +
  xlab("time") +
  ylab("Apple Daily Close Stock Price") +
  ggtitle("Simple Linear Model")

rmse(df$apple_close, df$apple_naive_pred)
## [1] 14.15821
rmse(df$apple_close[lower.b:upper.b], df$apple_naive_pred[lower.b:upper.b])
## [1] 11.45351
#Facebook
fb_lm = lm(fb_close ~ Open + Date, data = df)
summary(fb_lm)
## 
## Call:
## lm(formula = fb_close ~ Open + Date, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -21.215  -5.594  -0.558   4.397  33.428 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.024e+03  1.689e+01  -60.62   <2e-16 ***
## Open         2.239e-02  2.195e-03   10.20   <2e-16 ***
## Date         6.458e-02  1.272e-03   50.78   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.621 on 1487 degrees of freedom
## Multiple R-squared:  0.969,  Adjusted R-squared:  0.9689 
## F-statistic: 2.321e+04 on 2 and 1487 DF,  p-value: < 2.2e-16
df$fb_naive_pred = predict(fb_lm, data=df$Open)
df$fb_naive_residual = df$fb_close - df$fb_naive_pred
ggplot(data = df, aes(x = Date)) +
  geom_line(aes(y = fb_close, color = "red")) +
  geom_line(aes(y = fb_naive_pred, color = "blue")) +
  xlab("time") +
  ylab("Facebook Daily Close Stock Price") +
  ggtitle("Simple Linear Model")

rmse(df$fb_close, df$fb_naive_pred)
## [1] 8.612343
rmse(df$fb_close[lower.b:upper.b], df$fb_naive_pred[lower.b:upper.b])
## [1] 11.61579
#Google
google_lm = lm(google_close ~ Open + Date, data = df)
summary(google_lm)
## 
## Call:
## lm(formula = google_close ~ Open + Date, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -153.400  -27.964    8.295   33.648  142.760 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3.451e+03  1.061e+02  -32.53   <2e-16 ***
## Open         2.014e-01  1.378e-02   14.61   <2e-16 ***
## Date         2.228e-01  7.987e-03   27.89   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 54.14 on 1487 degrees of freedom
## Multiple R-squared:  0.9377, Adjusted R-squared:  0.9376 
## F-statistic: 1.12e+04 on 2 and 1487 DF,  p-value: < 2.2e-16
df$google_naive_pred = predict(google_lm, data=df$Open)
df$google_naive_residual = df$google_close - df$google_naive_pred
ggplot(data = df, aes(x = Date)) +
  geom_line(aes(y = google_close, color = "red")) +
  geom_line(aes(y = google_naive_pred, color = "blue")) +
  xlab("time") +
  ylab("Google Daily Close Stock Price") +
  ggtitle("Simple Linear Model")

rmse(df$google_close, df$google_naive_pred)
## [1] 54.08638
rmse(df$google_close[lower.b:upper.b], df$google_naive_pred[lower.b:upper.b])
## [1] 61.21239

Method 2: ARIMA Time Series Models

We then proceeded to fit ARIMA time series models for each of the three companies of interest. The model is specified as follows:

\[ \begin{aligned} Close_t &= ARIMA_{(p, q, d) \times (P, Q, D)_s}(Close_{t-1, \cdots}) + \beta_1 * Open_{(S\&P_500)} \end{aligned} \]

Mean structure was added to the models with the open price of S&P apart from the ARIMA time series in predicting Close. We first fitted auto.arima models on all of the three companies and examined their residual plots for further model improvements.

The residual plot for Apple after fitting the auto ARIMA(2,1,0) model looks approximately randomized without salient spikes, so we decided to go with the result. As for Facebook, through close examination of the residual plot after fitting auto ARIMA(2,1,2), we found spikes at period 30. We thus tried to add a seasonal AR or MA trend to the model for potential improvements. However, the residual plots after adding seasonal terms did not show any sign of better performance of the model. What is more, we noticed that the autocorrelation with lag 30 is about 0.1, which is relatively small. So we decided to stick with the result from auto.arima with model ARIMA(2,1,2) for Facebook. We then examined the plots of the auto ARIMA(2,1,2) model for Google: From ACF/PACF plots, we found spikes and potential periodic trend. Thus we treid out different ways of adding seasonal terms to the original ARIMA(2,1,2) in an attempt to get a better model. Yet by evaluating residual plots, adding seasonal terms don’t seem to give a better performance. Plus, we notice that autocorrelation with lag 30 is about 0.1, which is relatively small. Taking all of the above into accound, we eventually decided to stick with the result from auto.arima as well for Google.

The resulting plots of actual closing price and the time series predicted price over Date give fairly close values at each time stamp for all of the three companies, which suggests that ARIMA models are doing a decent job in predicting Close. Also, the root mean squared error(RMSE) for the models of Apple and Facebook, in evaluating the model performances, are both in the range of 1.6~1.7, a fairly small RMSE especially in comparison to the previous simple linear regression. The RMSE for the ARIMA model of Google is slightly higher (9.3), but also suggests an acceptable prediction performance.

#Apple
apple.ts = auto.arima(df %>% select(apple_close), xreg = df$Open, seasonal = TRUE)
apple.ts %>% summary()
## Series: df %>% select(apple_close) 
## Regression with ARIMA(2,1,0) errors 
## 
## Coefficients:
##          ar1      ar2    xreg
##       0.0092  -0.0491  0.0053
## s.e.  0.0303   0.0262  0.0033
## 
## sigma^2 estimated as 2.676:  log likelihood=-2844.26
## AIC=5696.52   AICc=5696.54   BIC=5717.74
## 
## Training set error measures:
##                      ME     RMSE      MAE        MPE     MAPE      MASE
## Training set 0.05806906 1.633794 1.149067 0.03724077 1.097862 0.9989548
##                      ACF1
## Training set -0.001345197
ggtsdisplay(apple.ts$residuals, main = "ARIMA(2,1,0)")

#Residual plot looks good and we will go with the result.
df$apple_ts_pred = c(apple.ts$fitted)
ggplot(data = df, aes(x = Date)) +
  geom_line(aes(y = apple_close, color = "red")) +
  geom_line(aes(y = apple_ts_pred, color = "blue")) +
  xlab("time") +
  ylab("Apple Daily Close Stock Price") +
  ggtitle("ARIMA(2,1,0)")

rmse(df$apple_close, df$apple_ts_pred)
## [1] 1.633794
rmse(df$apple_close[lower.b:upper.b], df$apple_ts_pred[lower.b:upper.b])
## [1] 2.192759
#Facebook
facebook.ts = auto.arima(df %>% select(fb_close), xreg = df$Open, seasonal = TRUE)
facebook.ts %>% summary()
## Series: df %>% select(fb_close) 
## Regression with ARIMA(2,1,2) errors 
## 
## Coefficients:
##           ar1     ar2     ma1      ma2   drift    xreg
##       -0.0576  0.8242  0.0539  -0.8907  0.0860  0.0021
## s.e.   0.0700  0.0682  0.0574   0.0571  0.0306  0.0031
## 
## sigma^2 estimated as 2.82:  log likelihood=-2881.82
## AIC=5777.64   AICc=5777.72   BIC=5814.78
## 
## Training set error measures:
##                        ME     RMSE      MAE        MPE     MAPE      MASE
## Training set -0.002329131 1.675428 1.120526 -0.1057361 1.517573 0.9969723
##                       ACF1
## Training set -0.0004667628
ggtsdisplay(facebook.ts$residuals, main = "ARIMA(2,1,2)")

#After looking at the residual plot, we found spikes at period 30 and tried to see if having a seasonal AR or MA trend makes the model better.
facebook.try1 = Arima(df %>% select(fb_close), xreg = df$Open, order = c(2, 1, 2),seasonal = list(order = c(0, 0, 1),period = 30))
ggtsdisplay(facebook.try1$residuals)

facebook.try2 = Arima(df %>% select(fb_close), xreg = df$Open, order = c(2, 1, 2),seasonal = list(order = c(1, 0, 0),period = 30))
ggtsdisplay(facebook.try2$residuals)

#However, by evaluating residual plots, adding seasonal terms don't seem to give a better performance. Plus, we can see that autocorrelation with lag 30 is about 0.1, which is relatively small. So, we decided to stick with the result from auto.arima.
df$fb_ts_pred = c(facebook.ts$fitted)
ggplot(data = df, aes(x = Date)) +
  geom_line(aes(y = fb_close, color = "red")) +
  geom_line(aes(y = fb_ts_pred, color = "blue")) +
  xlab("time") +
  ylab("Facebook Daily Close Stock Price") +
  ggtitle("ARIMA(2,1,2)")

rmse(df$fb_close, df$fb_ts_pred)
## [1] 1.675428
rmse(df$fb_close[lower.b:upper.b], df$fb_ts_pred[lower.b:upper.b])
## [1] 2.601592
#Google
google.ts = auto.arima(df %>% select(google_close), xreg = df$Open, seasonal = TRUE)
google.ts %>% summary()
## Series: df %>% select(google_close) 
## Regression with ARIMA(2,1,2) errors 
## 
## Coefficients:
##           ar1      ar2     ma1     ma2   drift    xreg
##       -0.5939  -0.8918  0.5656  0.8422  0.4368  0.0666
## s.e.   0.1005   0.0927  0.1259  0.1076  0.2345  0.0197
## 
## sigma^2 estimated as 87.05:  log likelihood=-5435.1
## AIC=10884.2   AICc=10884.28   BIC=10921.34
## 
## Training set error measures:
##                      ME     RMSE      MAE         MPE    MAPE      MASE
## Training set 0.02345534 9.307922 6.106312 -0.01130765 0.96099 0.9965447
##                     ACF1
## Training set 0.008764646
ggtsdisplay(google.ts$residuals, main = "ARIMA(2,1,2)")

#From ACF/PACF plots, we found spikes and potential periodic trend so we tried to add seasonal terms to the model.
google.try1 = Arima(df %>% select(google_close), xreg = df$Open, order = c(2, 1, 2),seasonal = list(order = c(0, 1, 0),period = 7))
ggtsdisplay(google.try1$residuals)

google.try2 = Arima(df %>% select(google_close), xreg = df$Open, order = c(2, 1, 2),seasonal = list(order = c(1, 1, 0),period = 7))
ggtsdisplay(google.try2$residuals)

google.try3 = Arima(df %>% select(google_close), xreg = df$Open, order = c(2, 1, 2),seasonal = list(order = c(0, 1, 1),period = 7))
ggtsdisplay(google.try3$residuals)

#However, after trying different combinations of seasonal terms, we found that ACF/PACF haven't improved much. Plus, autocorrelation value relatively low. So we decided to go with the result from auto.arima function.
df$google_ts_pred = c(google.ts$fitted)
ggplot(data = df, aes(x = Date)) +
  geom_line(aes(y = google_close, color = "red")) +
  geom_line(aes(y = google_ts_pred, color = "blue")) +
  xlab("time") +
  ylab("Google Daily Close Stock Price") +
  ggtitle("ARIMA(2,1,2)")

rmse(df$google_close, df$google_ts_pred)
## [1] 9.307922
rmse(df$google_close[lower.b:upper.b], df$google_ts_pred[lower.b:upper.b])
## [1] 13.79383

Method 3: Gaussian Process

\[ Close_t = \beta X + w_{t} \\ w_{t} \sim GP(0,\Sigma)\\ \Sigma \sim square \ exponential \] Because it takes a long time for JAGS to run large datasets, we subset the dataset to a year of data from 2017-4-21 to 2018-4-20.

###Apple

apple_emp_cloud = subset %>% emp_semivariogram(apple_naive_residual,Date)
apple_emp = rbind(
  subset %>% emp_semivariogram(apple_naive_residual, Date, bin=TRUE, binwidth=5)  %>% mutate(binwidth="binwidth=5"),
  subset %>% emp_semivariogram(apple_naive_residual, Date, bin=TRUE, binwidth=20) %>% mutate(binwidth="binwidth=20"),
  subset %>% emp_semivariogram(apple_naive_residual, Date, bin=TRUE, binwidth=10) %>% mutate(binwidth="binwidth=10"),
  subset %>% emp_semivariogram(apple_naive_residual, Date, bin=TRUE, binwidth=40)   %>% mutate(binwidth="binwidth=40"),
  subset %>% emp_semivariogram(apple_naive_residual, Date, bin=TRUE, binwidth=30)  %>% mutate(binwidth="binwidth=30")
)
apple_emp %>%
  ggplot(aes(x=h, y=gamma)) +
  geom_point(size = 1) +
  ggtitle("Empirical Semivariogram of Apple (binned)")+
  facet_wrap(~binwidth, nrow=2)

apple_gp_exp_model = "model{
  y ~ dmnorm(mu, inverse(Sigma))
  for (i in 1:N) {
    mu[i] <- beta[1]+ beta[2] * x[i]
  }
  
  for (i in 1:(N-1)) {
    for (j in (i+1):N) {
      Sigma[i,j] <- sigma2 * exp(- pow(l*d[i,j],2))
      Sigma[j,i] <- Sigma[i,j]
    }
  }
  for (k in 1:N) {
    Sigma[k,k] <- sigma2 + sigma2_w
  }
  for (i in 1:2) {
    beta[i] ~ dt(coef[i], 2.5, 1)
  }
  sigma2_w ~ dnorm(10, 1/25) T(0,)
  sigma2   ~ dnorm(50, 1/25) T(0,)
  l        ~ dt(0,2.5,1) T(0,) 
}"
if (file.exists("apple_gp_jags.Rdata")) {
  load(file="apple_gp_jags.Rdata")
} else {
  m = rjags::jags.model(
    textConnection(apple_gp_exp_model), 
    data = list(
      y = subset$apple_close,
      x = subset$Open,
      d = dist(subset$Date) %>% as.matrix(),
      N = nrow(subset),
      coef = coef(apple_lm)
    ),
    quiet = TRUE
  )
  update(m, n.iter=2000)
  exp_cov_coda = rjags::coda.samples(
    m, variable.names=c("beta", "sigma2", "l", "sigma2_w"),
    n.iter=2000, thin=10
  )
  save(exp_cov_coda, file="apple_gp_jags.Rdata")
}
betas = tidybayes::gather_draws(exp_cov_coda, beta[i]) %>%
  ungroup() %>%
  mutate(.variable = paste0(.variable, "[",i,"]")) %>%
  select(-i)
betas %>%
  group_by(.variable) %>%
  slice(seq(1,n(),length.out=500)) %>%
  ggplot(aes(x=.iteration, y=.value, color=.variable)) +
    geom_line() +
    facet_grid(.variable~., scales = "free_y")

params = tidybayes::gather_draws(exp_cov_coda, sigma2, l, sigma2_w)
params %>%
  slice(seq(1,n(),length.out=500)) %>%
  ggplot(aes(x=.iteration, y=.value, color=.variable)) +
    geom_line() +
    facet_grid(.variable~., scales="free_y")

params %>%
  slice(seq(1,n(),length.out=500)) %>%
  ggplot(aes(x=.value, fill=.variable)) +
    geom_density() +
    facet_wrap(~.variable, scales="free") +
    guides(fill=FALSE)

params %>%
  slice(seq(1,n(),length.out=500)) %>% 
  filter(.variable == "l") %>%
  ggplot(aes(x=.value, fill=.variable)) +
    geom_density() +
    scale_x_log10() +
    facet_wrap(~.variable, scales="free") +
    guides(fill=FALSE)

post = bind_rows(betas, params) %>%
  group_by(.variable) %>%
  summarize(
    post_mean = mean(.value),
    post_med  = median(.value),
    post_lower = quantile(.value, probs = 0.025),
    post_upper = quantile(.value, probs = 0.975)
  )
knitr::kable(post, digits = 5)
.variable post_mean post_med post_lower post_upper
beta[1] 59.63543 55.80959 37.46380 89.79846
beta[2] 0.03976 0.04138 0.02815 0.04903
l 0.09253 0.09374 0.07121 0.11468
sigma2 48.76159 48.84577 39.31196 56.76181
sigma2_w 2.96200 2.92310 2.34422 3.71781
l = post %>% filter(.variable == 'l') %>% pull(post_med)
sigma2 = post %>% filter(.variable == 'sigma2') %>% pull(post_med)
sigma2_w = post %>% filter(.variable == 'sigma2_w') %>% pull(post_med)
beta0 = post %>% filter(.variable == 'beta[1]') %>% pull(post_med)
beta1 = post %>% filter(.variable == 'beta[2]') %>% pull(post_med)
df = df %>% mutate(apple_gp_resid = apple_close - beta0 - beta1 * Open)
reps=1000
x = subset$Open
y = subset$apple_close
d = subset$Date
x_pred = subset$Open
d_pred = subset$Date + rnorm(252, 0.01)
mu = beta0 + beta1*x
mu_pred = beta0 + beta1*x_pred
dist_o = fields::rdist(d)
dist_p = fields::rdist(d_pred)
dist_op = fields::rdist(d, d_pred)
dist_po = t(dist_op)
cov_o  = sq_exp_cov(dist_o,  sigma2 = sigma2, l = l, sigma2_w = sigma2_w)
cov_p  = sq_exp_cov(dist_p,  sigma2 = sigma2, l = l, sigma2_w = sigma2_w)
cov_op = sq_exp_cov(dist_op, sigma2 = sigma2, l = l, sigma2_w = sigma2_w)
cov_po = sq_exp_cov(dist_po, sigma2 = sigma2, l = l, sigma2_w = sigma2_w)
cond_cov = cov_p - cov_po %*% solve(cov_o) %*% cov_op
cond_mu  = mu_pred + cov_po %*% solve(cov_o) %*% (y - mu)
pred_bayes = cond_mu %*% matrix(1, ncol=reps) + t(chol(cond_cov)) %*% matrix(rnorm(length(x_pred)*reps), ncol=reps)
apple_pred_df_bayes = pred_bayes %>% t() %>% post_summary() %>% mutate(x=x_pred)
apple_pred_df_bayes$Date = subset$Date
ggplot(subset, aes(x=Date)) +
  geom_line(aes(y=apple_close)) +
  geom_ribbon(data=apple_pred_df_bayes, aes(ymin=post_lower,ymax=post_upper, x=Date), fill="red", alpha=0.5) +
  geom_line(data=apple_pred_df_bayes, aes(y=post_mean), color='blue', size=0.5) + 
  ylab("Apple Daily Close Stock Price")

#use mean as the predicted value from bayes to calculate RMSE
subset$apple_gp_pred = apple_pred_df_bayes$post_mean
rmse(subset$apple_close, subset$apple_gp_pred)
## [1] 1.634382

###Facebook

fb_emp_cloud = subset %>% emp_semivariogram(fb_naive_residual,Date)
fb_emp = rbind(
  subset %>% emp_semivariogram(fb_naive_residual, Date, bin=TRUE, binwidth=1)  %>% mutate(binwidth="binwidth=1"),
  subset %>% emp_semivariogram(fb_naive_residual, Date, bin=TRUE, binwidth=5) %>% mutate(binwidth="binwidth=5"),
  subset %>% emp_semivariogram(fb_naive_residual, Date, bin=TRUE, binwidth=10) %>% mutate(binwidth="binwidth=10"),
  subset %>% emp_semivariogram(fb_naive_residual, Date, bin=TRUE, binwidth=15)   %>% mutate(binwidth="binwidth=15"),
  subset %>% emp_semivariogram(fb_naive_residual, Date, bin=TRUE, binwidth=30)  %>% mutate(binwidth="binwidth=30")
)
fb_emp %>%
  ggplot(aes(x=h, y=gamma)) +
  geom_point(size = 1) +
  ggtitle("Empirical Semivariogram of Facebook (binned)")+
  facet_wrap(~binwidth, nrow=2)

fb_gp_exp_model = "model{
  y ~ dmnorm(mu, inverse(Sigma))
  for (i in 1:N) {
    mu[i] <- beta[1]+ beta[2] * x[i]
  }
  
  for (i in 1:(N-1)) {
    for (j in (i+1):N) {
      Sigma[i,j] <- sigma2 * exp(- pow(l*d[i,j],2))
      Sigma[j,i] <- Sigma[i,j]
    }
  }
  for (k in 1:N) {
    Sigma[k,k] <- sigma2 + sigma2_w
  }
  for (i in 1:2) {
    beta[i] ~ dt(coef[i], 2.5, 1)
  }
  sigma2_w ~ dnorm(10, 1/25) T(0,)
  sigma2   ~ dnorm(390, 1/200) T(0,)
  l        ~ dt(0,2.5,1) T(0,) 
}"
if (file.exists("fb_gp_jags.Rdata")) {
  load(file="fb_gp_jags.Rdata")
} else {
  m = rjags::jags.model(
    textConnection(fb_gp_exp_model), 
    data = list(
      y = subset$fb_close,
      x = subset$Open,
      d = dist(subset$Date) %>% as.matrix(),
      N = nrow(subset),
      coef = coef(fb_lm)
    ),
    quiet = TRUE
  )
  update(m, n.iter=2000)
  exp_cov_coda = rjags::coda.samples(
    m, variable.names=c("beta", "sigma2", "l", "sigma2_w"),
    n.iter=2000, thin=10
  )
  save(exp_cov_coda, file="fb_gp_jags.Rdata")
}
betas = tidybayes::gather_draws(exp_cov_coda, beta[i]) %>%
  ungroup() %>%
  mutate(.variable = paste0(.variable, "[",i,"]")) %>%
  select(-i)
betas %>%
  group_by(.variable) %>%
  slice(seq(1,n(),length.out=500)) %>%
  ggplot(aes(x=.iteration, y=.value, color=.variable)) +
    geom_line() +
    facet_grid(.variable~., scales = "free_y")

params = tidybayes::gather_draws(exp_cov_coda, sigma2, l, sigma2_w)
params %>%
  slice(seq(1,n(),length.out=500)) %>%
  ggplot(aes(x=.iteration, y=.value, color=.variable)) +
    geom_line() +
    facet_grid(.variable~., scales="free_y")

params %>%
  slice(seq(1,n(),length.out=500)) %>%
  ggplot(aes(x=.value, fill=.variable)) +
    geom_density() +
    facet_wrap(~.variable, scales="free") +
    guides(fill=FALSE)

params %>%
  slice(seq(1,n(),length.out=500)) %>% 
  filter(.variable == "l") %>%
  ggplot(aes(x=.value, fill=.variable)) +
    geom_density() +
    scale_x_log10() +
    facet_wrap(~.variable, scales="free") +
    guides(fill=FALSE)

post = bind_rows(betas, params) %>%
  group_by(.variable) %>%
  summarize(
    post_mean = mean(.value),
    post_med  = median(.value),
    post_lower = quantile(.value, probs = 0.025),
    post_upper = quantile(.value, probs = 0.975)
  )
knitr::kable(post, digits = 5)
.variable post_mean post_med post_lower post_upper
beta[1] 138.00653 137.23548 92.06659 190.34675
beta[2] 0.01138 0.01184 -0.00740 0.03109
l 0.06331 0.06302 0.05771 0.07039
sigma2 385.32748 384.58300 357.71997 417.85007
sigma2_w 4.45172 4.45002 3.73450 5.41662
l = post %>% filter(.variable == 'l') %>% pull(post_med)
sigma2 = post %>% filter(.variable == 'sigma2') %>% pull(post_med)
sigma2_w = post %>% filter(.variable == 'sigma2_w') %>% pull(post_med)
beta0 = post %>% filter(.variable == 'beta[1]') %>% pull(post_med)
beta1 = post %>% filter(.variable == 'beta[2]') %>% pull(post_med)
df = df %>% mutate(fb_gp_resid = fb_close - beta0 - beta1 * Open)
reps=1000
x = subset$Open
y = subset$fb_close
d = subset$Date
x_pred = subset$Open
d_pred = subset$Date + rnorm(252, 0.01)
mu = beta0 + beta1*x
mu_pred = beta0 + beta1*x_pred
dist_o = fields::rdist(d)
dist_p = fields::rdist(d_pred)
dist_op = fields::rdist(d, d_pred)
dist_po = t(dist_op)
cov_o  = sq_exp_cov(dist_o,  sigma2 = sigma2, l = l, sigma2_w = sigma2_w)
cov_p  = sq_exp_cov(dist_p,  sigma2 = sigma2, l = l, sigma2_w = sigma2_w)
cov_op = sq_exp_cov(dist_op, sigma2 = sigma2, l = l, sigma2_w = sigma2_w)
cov_po = sq_exp_cov(dist_po, sigma2 = sigma2, l = l, sigma2_w = sigma2_w)
cond_cov = cov_p - cov_po %*% solve(cov_o) %*% cov_op
cond_mu  = mu_pred + cov_po %*% solve(cov_o) %*% (y - mu)
pred_bayes = cond_mu %*% matrix(1, ncol=reps) + t(chol(cond_cov)) %*% matrix(rnorm(length(x_pred)*reps), ncol=reps)
fb_pred_df_bayes = pred_bayes %>% t() %>% post_summary() %>% mutate(x=x_pred)
fb_pred_df_bayes$Date = subset$Date
ggplot(subset, aes(x=Date)) +
  geom_line(aes(y=fb_close)) +
  geom_ribbon(data=fb_pred_df_bayes, aes(ymin=post_lower,ymax=post_upper, x=Date), fill="red", alpha=0.5) +
  geom_line(data=fb_pred_df_bayes, aes(y=post_mean), color='blue', size=0.5) + 
  ylab("Facebook Daily Close Stock Price")

#use mean as the predicted value from bayes to calculate RMSE
subset$fb_gp_pred = fb_pred_df_bayes$post_mean
rmse(subset$fb_close, subset$fb_gp_pred)
## [1] 1.970619

###Google

google_emp_cloud = subset %>% emp_semivariogram(google_naive_residual,Date)
google_emp = rbind(
  subset %>% emp_semivariogram(google_naive_residual, Date, bin=TRUE, binwidth=1)  %>% mutate(binwidth="binwidth=1"),
  subset %>% emp_semivariogram(google_naive_residual, Date, bin=TRUE, binwidth=5) %>% mutate(binwidth="binwidth=5"),
  subset %>% emp_semivariogram(google_naive_residual, Date, bin=TRUE, binwidth=10) %>% mutate(binwidth="binwidth=10"),
  subset %>% emp_semivariogram(google_naive_residual, Date, bin=TRUE, binwidth=15)   %>% mutate(binwidth="binwidth=15"),
  subset %>% emp_semivariogram(google_naive_residual, Date, bin=TRUE, binwidth=30)  %>% mutate(binwidth="binwidth=30")
)
google_emp %>%
  ggplot(aes(x=h, y=gamma)) +
  geom_point(size = 1) +
  ggtitle("Empirical Semivariogram of Google (binned)")+
  facet_wrap(~binwidth, nrow=2)

google_gp_exp_model = "model{
  y ~ dmnorm(mu, inverse(Sigma))
  for (i in 1:N) {
    mu[i] <- beta[1]+ beta[2] * x[i]
  }
  
  for (i in 1:(N-1)) {
    for (j in (i+1):N) {
      Sigma[i,j] <- sigma2 * exp(- pow(l*d[i,j],2))
      Sigma[j,i] <- Sigma[i,j]
    }
  }
  for (k in 1:N) {
    Sigma[k,k] <- sigma2 + sigma2_w
  }
  for (i in 1:2) {
    beta[i] ~ dt(coef[i], 2.5, 1)
  }
  sigma2_w ~ dnorm(100, 1/100) T(0,)
  sigma2   ~ dnorm(700, 1/400) T(0,)
  l        ~ dt(0,2.5,1) T(0,) 
}"
if (file.exists("google_gp_jags.Rdata")) {
  load(file="google_gp_jags.Rdata")
} else {
  m = rjags::jags.model(
    textConnection(google_gp_exp_model), 
    data = list(
      y = subset$google_close,
      x = subset$Open,
      d = dist(subset$Date) %>% as.matrix(),
      N = nrow(subset),
      coef = coef(google_lm)
    ),
    quiet = TRUE
  )
  update(m, n.iter=2000)
  exp_cov_coda = rjags::coda.samples(
    m, variable.names=c("beta", "sigma2", "l", "sigma2_w"),
    n.iter=2000, thin=10
  )
  save(exp_cov_coda, file="google_gp_jags.Rdata")
}
betas = tidybayes::gather_draws(exp_cov_coda, beta[i]) %>%
  ungroup() %>%
  mutate(.variable = paste0(.variable, "[",i,"]")) %>%
  select(-i)
betas %>%
  group_by(.variable) %>%
  slice(seq(1,n(),length.out=500)) %>%
  ggplot(aes(x=.iteration, y=.value, color=.variable)) +
    geom_line() +
    facet_grid(.variable~., scales = "free_y")

params = tidybayes::gather_draws(exp_cov_coda, sigma2, l, sigma2_w)
params %>%
  slice(seq(1,n(),length.out=500)) %>%
  ggplot(aes(x=.iteration, y=.value, color=.variable)) +
    geom_line() +
    facet_grid(.variable~., scales="free_y")

params %>%
  slice(seq(1,n(),length.out=500)) %>%
  ggplot(aes(x=.value, fill=.variable)) +
    geom_density() +
    facet_wrap(~.variable, scales="free") +
    guides(fill=FALSE)

params %>%
  slice(seq(1,n(),length.out=500)) %>% 
  filter(.variable == "l") %>%
  ggplot(aes(x=.value, fill=.variable)) +
    geom_density() +
    scale_x_log10() +
    facet_wrap(~.variable, scales="free") +
    guides(fill=FALSE)

post = bind_rows(betas, params) %>%
  group_by(.variable) %>%
  summarize(
    post_mean = mean(.value),
    post_med  = median(.value),
    post_lower = quantile(.value, probs = 0.025),
    post_upper = quantile(.value, probs = 0.975)
  )
knitr::kable(post, digits = 5)
.variable post_mean post_med post_lower post_upper
beta[1] -498.77787 -500.77321 -505.52283 -468.68757
beta[2] 0.58398 0.58450 0.57281 0.58877
l 0.09541 0.09349 0.07905 0.12334
sigma2 697.70238 697.63588 663.18587 732.68595
sigma2_w 126.16484 126.77453 113.52041 139.03941
l = post %>% filter(.variable == 'l') %>% pull(post_med)
sigma2 = post %>% filter(.variable == 'sigma2') %>% pull(post_med)
sigma2_w = post %>% filter(.variable == 'sigma2_w') %>% pull(post_med)
beta0 = post %>% filter(.variable == 'beta[1]') %>% pull(post_med)
beta1 = post %>% filter(.variable == 'beta[2]') %>% pull(post_med)
df = df %>% mutate(google_gp_resid = google_close - beta0 - beta1 * Open)
reps=1000
x = subset$Open
y = subset$google_close
d = subset$Date
x_pred = subset$Open
d_pred = subset$Date + rnorm(252, 0.01)
mu = beta0 + beta1*x
mu_pred = beta0 + beta1*x_pred
dist_o = fields::rdist(d)
dist_p = fields::rdist(d_pred)
dist_op = fields::rdist(d, d_pred)
dist_po = t(dist_op)
cov_o  = sq_exp_cov(dist_o,  sigma2 = sigma2, l = l, sigma2_w = sigma2_w)
cov_p  = sq_exp_cov(dist_p,  sigma2 = sigma2, l = l, sigma2_w = sigma2_w)
cov_op = sq_exp_cov(dist_op, sigma2 = sigma2, l = l, sigma2_w = sigma2_w)
cov_po = sq_exp_cov(dist_po, sigma2 = sigma2, l = l, sigma2_w = sigma2_w)
cond_cov = cov_p - cov_po %*% solve(cov_o) %*% cov_op
cond_mu  = mu_pred + cov_po %*% solve(cov_o) %*% (y - mu)
pred_bayes = cond_mu %*% matrix(1, ncol=reps) + t(chol(cond_cov)) %*% matrix(rnorm(length(x_pred)*reps), ncol=reps)
google_pred_df_bayes = pred_bayes %>% t() %>% post_summary() %>% mutate(x=x_pred)
google_pred_df_bayes$Date = subset$Date
ggplot(subset, aes(x=Date)) +
  geom_line(aes(y=google_close)) +
  geom_ribbon(data=google_pred_df_bayes, aes(ymin=post_lower,ymax=post_upper, x=Date), fill="red", alpha=0.5) +
  geom_line(data=google_pred_df_bayes, aes(y=post_mean), color='blue', size=0.5) + 
  ylab("Google Daily Close Stock Price")

#use mean as the predicted value from bayes to calculate RMSE
subset$google_gp_pred = google_pred_df_bayes$post_mean
rmse(subset$google_close, subset$google_gp_pred)
## [1] 12.26636

Conclusion

subset %>% summarise(method = "rmse",
                     google.lm = rmse(google_close, google_naive_pred),
                     google.arima = rmse(google_close, google_ts_pred),
                     google.gp = rmse(google_close, google_gp_pred),
                     fb.lm = rmse(fb_close, fb_naive_pred),
                     fb.arima = rmse(fb_close, fb_ts_pred),
                     fb.gp = rmse(fb_close, fb_gp_pred),
                     apple.lm = rmse(apple_close, apple_naive_pred),
                     apple.arima = rmse(apple_close, apple_ts_pred),
                     apple.gp = rmse(apple_close, apple_gp_pred))
##   method google.lm google.arima google.gp    fb.lm fb.arima    fb.gp
## 1   rmse  61.21239     13.79383  12.26636 11.61579 2.601592 1.970619
##   apple.lm apple.arima apple.gp
## 1 11.45351    2.192759 1.634382
  1. The time series trend in Close does exists and need a time-series model to expalin.

  2. Linear model can not explain time variablility fully, while ARIMA model and Gaussian process does a much better job. As we can see RMSE decreases as we switch from linear model to ARIMA and then to Gaussian process.

  3. Gaussian model outperforms ARIMA. The reason is probably that the appearence of weekends weakens the discrete assumption. Gaussian process considers the distance between days in a continuous way instead of a discrete way like ARIMA and therefore constructs a more expressive correlation between days.

  4. Weakness of Gaussian process is the fitting time. The fitting of 1000 entries never finished and the disablity to fit large scale data diminishes its predictive capability even if it already has the highes precision now. At the same time ARIMA has an advantage in the training time meanwhile maintaining acceptable increase in error.

  5. Validation and testing. Due to the fact we are actually testing on the training data, the overfitting situation may occur and cause issues in extrapolation.

  6. Our model has successfully explained some of the time-series variability.

Discussion

  1. FItting more data into a Gaussian process. Maybe parallel computing or a more powerful CPU can be a way to solution. Try RNN may also help as GP is a infinite neural network.

  2. If the time series are strictly discrete, ARIMA model may be better in both complexity and prediction.